Statistical Inference IV

Chelsea Parlett-Pelleriti

Bayesian Uncertainty

Bayesian Statistics

Bayesianism main Ideas:

  1. data \(X\) is fixed, and the parameters \(\theta\) of our process \(P_{\theta}\) are random

    • we imagine different parameter values that could exist
  2. inference relies on the idea of updating prior beliefs based on evidence from the data

  3. probabilities are used to quantify uncertainty we have about parameters

\[ \underbrace{p(\theta|d)}_\text{posterior} = \underbrace{\frac{p(d|\theta)}{p(d)}}_\text{update} \times \underbrace{p(\theta)}_\text{prior} \]

Bayesian Uncertainty

Bayesian Uncertainty

Bayesians think parameters are random. Frequentists think intervals (data) are random.

Bayesians make probability statements about parameters. Frequentists make probability statements about intervals.

Bayes’ Rule

\[ P(A \mid B) = \frac{P(A\cup B)}{P(B)} = \frac{P(B \mid A) \cdot P(A)}{P(B)} \]

Bayes’ Rule is a way to calculate conditional probabilities

Bayes’ Rule

\[ P(\text{covid} \mid +) = \frac{P(+ \mid \text{covid}) \cdot P(\text{covid})}{P(+)} \]

What is the probability of having covid given that you got a positive covid test?

Bayes’ Rule

\[ P(\theta \mid \text{data}) = \frac{P(\text{data} \mid \theta) \cdot P(\theta)}{P(\text{data})} \]

“How did my theory change after seeing the data?”

Bayes’ Rule

\[ \color{#D55E00}{P(\theta \mid \text{data})} = \frac{P(\text{data} \mid \theta) \cdot P(\theta)}{P(\text{data})} \]

The Posterior is the probability of \(\theta\) after seeing the data.

Bayes’ Rule

\[ P(\theta \mid \text{data}) = \frac{P(\text{data} \mid \theta) \cdot \color{#CC79A7}{P(\theta)}}{P(\text{data})} \]

The Prior is the probability distribution of \(\theta\) before seeing the data.

Bayes’ Rule

\[ P(\theta \mid \text{data}) = \frac{\color{#F5C710}{P(\text{data} \mid \theta)} \cdot P(\theta)}{P(\text{data})} \]

The Likelihood is the probability of our data, given \(\theta\), for various different \(\theta\)s

:::

Bayes’ Rule

\[ P(\theta \mid \text{data}) = \frac{P(\text{data} \mid \theta) \cdot P(\theta)}{\color{#009E73}{P(\text{data})}} \]

The Normalizing Constant is the probability of our data.It normalizes the posterior so that it’s a valid probability (distribution).

It does not matter.

Bayes’ Rule

\[ P(\theta \mid \text{data}) = \frac{P(\text{data} \mid \theta) \cdot P(\theta)}{\color{#009E73}{P(\text{data})}} \]

The Normalizing Constant is the probability of our data.It normalizes the posterior so that it’s a valid probability (distribution).

The normalizing constant makes \(P(\theta \mid \text{data})\) a valid probability distribution (i.e. \(\int P(\theta \mid \text{data}) d \theta = 1\)) but, it’s just a scalar constant…so \(P(\text{data} \mid \theta) \cdot P(\theta) \propto P(\theta \mid \text{data})\) 👀

Bayes Rule and MCMC

\[ \left[P(\text{data} \mid \theta) \cdot P(\theta) \right] \propto P(\theta \mid \text{data}) \]

\[ \text{likelihood} \cdot \text{prior} \propto \text{posterior} \]

we have a function \(f(x) = P(\text{data} \mid \theta) \cdot P(\theta)\) that is proportional to a probability distribution \(p(x) = P(\theta \mid \text{data})\) that we want to sample from, but it itself is not a proper probability distribution…

❓ What does that remind you of?

Bayes Rule and Parameter Estimates

Note: If we have draws from our posterior distribution \(p(x) = P(\theta \mid \text{data})\), we can use these draws to calculate any statistic we want: mean, median, quantiles on the draws or transformations of the draws.

[1] "Z: 9.98, W: 1"

Bayes Rule by Hand

Flu Test

\[ P(\text{flu} \mid \text{+}) = \frac{P(\text{+} \mid \text{flu}) \cdot P(\text{flu})}{{P(\text{+})}} \]

  • \(P(\text{flu}) = 0.05\) (prevalence of flu)

  • \(P(\text{+} \mid \text{flu}) = 0.99\) (sensitivity of test)

  • \(P(\text{+} \mid \text{no flu}) = 0.1\) (1- specificity of test)

  • \(P(\text{+}) = \underbrace{P(\text{+} \mid \text{flu})\cdot P(\text{flu})}_\text{way 1}+ \underbrace{P(\text{+} \mid \text{no flu})\cdot P(\text{no flu})}_\text{way 2}\)

Math

Bayes Rule by Hand

Beta-Binomial

We’re interested in estimating \(q\) the proportion of days it rains in California. It rained 12 of the last 365 days.

  • Binomial Likelihood: \(\mathcal{L}(q \mid x) = \binom{n}{x} q^{x} (1 - q)^{n - x}\)

  • Beta Prior: \(p \sim \text{Beta}(\alpha, \beta)= \frac{q^{\alpha-1} (1-q)^{\beta-1}}{B(\alpha,\beta)}\)

Beta Priors

Beta Prior Shiny App

Play around with the app for a minute changing alpha and beta until you find a prior that looks reasonable to you.

  • \(\alpha\): “successes” (rain days)
  • \(\beta\): “failures” (no-rain days)

Bayes Rule by Hand

Beta-Binomial

We’re interested in estimating \(q\) the proportion of days it rains in California. It rained 12 of the last 365 days (\(x\)).

  • Binomial Likelihood: \(\mathcal{L}(q \mid x) = \binom{365}{12} q^{12} (1 - q)^{365-12}\)

  • Beta Prior: \(q \sim \text{Beta}(1, 9) = \frac{q^{1-1} (1-q)^{9-1}}{B(1,9)}\)

Remember: \(P(q \mid x) \underbrace{\propto \binom{365}{12} q^{12} (1 - q)^{365-12}}_\text{likelihood} \times \underbrace{\frac{q^{1-1} (1-q)^{9-1}}{B(1,9)}}_\text{prior}\)

Math

Bayes Rule by Hand

Beta-Binomial

Bayesian Posteriors

Posteriors are inherently quantifications of uncertainty. They use a probability distribution to tell us the relative likelihood of different possible values of our parameter \(\theta\).

Bayesian Summaries

However, as great as they are, posterior distributions (or draws from a posterior) still have too much raw information to be useful.

Imagine handing your boss a posterior distribution in response to the question “how much more effective is blue font compared to black font”?

Bayesian Point Estimates

So, we still need summaries. In the Bayesian framework, we get point estimates by calculating statistics/summaries using our posterior.

E.g. the mean \(\mathbb{E}(p(\theta | x))\) of the posterior, or the median of the posterior.

In practice, we typically have samples from our posterior, \(p(\theta | x)\), not the distribution itself, but it’s actually easier to calculate summaries/statistics using samples! e.g.

\[ \frac{1}{n} \sum_{i=1}^n \underbrace{\theta_i}_\text{posterior sample of theta} \]

Credible Intervals

Credible Intervals are ranges of values \((lower,upper)\) that satisfies \(p(lb \leq \theta \leq ub) = c\), where \(c\) is a probability like 50%, 90%, 95%…etc.

Credible Interval Interpretation

  • There is a \(c\)% chance that \(\theta\) is between these values

literally that’s it…

CI vs. CI

  • Credible Interval (Bayesian): An interval within which the parameter \(\theta\) lies with a certain probability, given the observed data and the priors chosen.

  • Confidence Interval (Frequentist): An interval constructed so that under repeated sampling, a certain proportion of such intervals will contain the true parameter value \(\theta\).

Bayesian Interval Estimates: ETI

Equal Tailed Interval: choose an interval with \(c\)% of the mass of the posterior, with \(\frac{1-c}{2}\) of the mass in both the upper and lower tails.

\[ P(\theta \leq lb \mid x) = \frac{1-c}{2} \\ P(\theta \geq ub \mid x) = \frac{1-c}{2} \]

Bayesian Interval Estimates: ETI

This looks a little different when you have a skewed distribution.

Bayesian Interval Estimates: ETI

What seems off here?

Bayesian Interval Estimates: HDI

Highest Density Posterior Interval: choose an interval where the density \(p(\theta \mid x)\) is higher than outside the interval and contains a specified probability mass.

For all \(\theta \in [lb, ub]\) and \(\theta' \notin [lb,ub]\), \(p(\theta \mid x) \geq p(\theta' \mid x)\)

Unlike the ETI, the HDPI is not constrained to have equal tails.

Bayesian Interval Estimates: HDI

  • Step 1: Identify the density threshold \(k\) such that the interval \(\theta : p(\theta \mid x) \geq k\) contains \(1 - \alpha\) probability mass.
  • Step 2: define the interval \([lb,ub]\) that contains all \(\theta\) with \(p(\theta \mid x) \geq k\)

Bayesian Interval Estimates: HDI

Bayesian Interval Estimates: HDI

Bayesian Interval Estimates: HDI

Bayesian Interval Estimates: HDI

❓What would an ETI look like here? Any issues with that? Any issues with this HDI?

From: https://vioshyvo.github.io/Bayesian_inference/summarizing-the-posterior-distribution.html

ETI vs. HDI

  • ETI: easy to calculate, splits the excluded values from the posterior equally between extreme highs and extreme lows, most similar to frequentist CIs

  • HDI: will always contain the mode(s) of the posterior, and can account for asymmetry, will the be narrowest interval for a given confidence level \(c\)%

Bayesian Interval Estimates: Custom

The nice thing about Posteriors is that it’s easy to calculate any summary (point or interval) of the posterior that’s of interest to you.

  • We have the posterior for \(d\) the difference between the mean height of blondes, and the mean height of brunettes (\(\mu_{bl} - \mu_{br}\)), what’s the probability that \(d < 0\)

  • We have the posterior for \(p\) the payout you might get from the 1,000 lottery tickets you just bought, what would you win in the top 10% of scenarios you could expect to happen?

  • We have the posterior for \(t\), the amount of tips you expect to get from your hairdressing job. You only like going to work if you think you’ll make more than $200 in tips. What’s the probability that \(t \geq 200\)?

  • We have the posterior for \(c\), the click-rate of your new email campaign. We want to know whether it’s equivalent to the current campaign that has a click rate of 0.02. What’s the probability that \(0.01 \leq c \leq 0.03\) (which we’ve defined is practically equivalent)?

Bayesian Interval Estimates: ROPE

Region of Practical Equivalence: an interval/range of values that are practically equivalent to no effect.

  • any change in depression scores \(\pm 0.25\) is clinically irrelevant
  • any clickrate between \([0.01, 0.03]\) is practically equivalent to \(0.02\)
  • any regression coefficient that is \(0 \pm \frac{1}{10} sd\) means there’s no effect of that predictor

Smallest Effect Size of Interest: the smallest effect size that would be meaningful, clinically relevant, or impactful.

Bayesian Interval Estimates: ROPE

  1. Define a ROPE (use domain expertise or a “standard” small value like \(\frac{1}{10} sd\) )

  2. Calculate what % of your Posterior CI overlaps with ROPE

If there is a lot of overlap, evidence for practical equivalence. If little overlap, evidence for non-equivalence.

Bayesian Interval Estimates: ROPE